Providing On-line Access to Portuguese Language Resources: Corpora and Lexicons

نویسندگان

  • Maria Fernanda Bacelar do Nascimento
  • Amália Mendes
  • Luísa Pereira
چکیده

Several Language Resources (LRs) for Portuguese, developed at the Center of Linguistics of the Lisbon University (CLUL), are available on-line at CLUL’s webpage: www.clul.ul.pt/english/sectores/projecto_rld.html. These LRs have been extracted from or developed based on the Reference Corpus of Contemporary Portuguese (CRPC), a monitor corpus containing, at the present, more than 300 million words, taken by sampling from several types of written text (literary, newspaper, technical, didactic, juridical, parlamentary, etc.) and spoken text (informal and formal), pertaining to national and regional varieties of Portuguese (including European, Brazilian, African and Asian Portuguese). The LRs available for on-line queries include: a) several subcorpora (written and spoken, tagged and untagged) compiled and extracted from CRPC for specific CLUL’s projects and now available for on-line queries; b) a published sample of “Português Fundamental”, a spoken CRPC subcorpus, available for texts download; c) a frequency lexicon extracted from a CRPC subcorpus available for both online queries and download. Other RLs available for Portuguese are also referred: C-ORAL-ROM Integrated Reference Corpora for Spoken Romance Languages, a CD-ROM edition of a spoken corpus with text-to-sound alignment; the LE-PAROLE corpus; the LEPAROLE Lexicon and the SIMPLE Lexicon. 1 Institutions that have been giving finantial support to the CRPC: Fundação Calouste Gulbenkian, Junta Nacional de Investigação Científica e Tecnológica (JNICT) Programme Estímulo em Ciências Sociais e Humanas, Fundação para a Ciência e a Tecnologia (FCT) Fundos Programáticos, Instituto Camões, União Latina, Caixa Geral de Depósitos, Comissão das Comunidades Europeias LE-PAROLE Project. A net of public and private institutions is supplying data for CRPC (http://www.clul.ul.pt/english/frames.html). ON-LINE QUERIES TO EUROPEAN PORTUGUESE CORPORA Some CRPC subcorpora have been developed under specific projects at CLUL. These resources are available for on-line queries at CLUL's webpage, using CLUL’s concordancer CONCOR adapted to run on the Internet, as well as CLUL’s lemmatiser. When searching for a lemma or a wordform, it is possible to choose the corpus, to ask for concordances or frequencies, to sort the concordances results, to establish the context length and to obtain bibliographic references. The corpora available for on-line queries are the following:

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Some Language Resources and Tools for Computational Processing of Portuguese at INESC

In the last few years automatic processing tools and studies based on corpora have became of a great importance for the community. The possibility of evaluating and developing such tools and studies depends on the availability of language resources. For the Portuguese language in its several national varieties these resources are not enough to meet the community needs. In this paper some valuab...

متن کامل

UNITEX-PB, a set of flexible language resources for Brazilian Portuguese∗

This work documents the project and development of various computational linguistic resources that support the Brazilian Portuguese language according to the formal methodology used by the corpus processing system called UNITEX. The delivered resources include computational lexicons, libraries to access compressed lexicons, and additional tools to validate those resources.

متن کامل

Providing Internet Access to Portuguese Corpora: the AC/DC Project

In this paper we report on the activity of the project Computational Processing of Portuguese (Processamento computacional do português) in what concerns providing access to Portuguese corpora through the Internet. One of its activities, the AC/DC project (Acesso a corpora/Disponibilização de Corpora, roughly "Access and Availability of Corpora") allows a user to query around 40 million words o...

متن کامل

Automatic processing of multilingual medical terminology: applications to thesaurus enrichment and cross-language information retrieval

OBJECTIVES We present in this article experiments on multi-language information extraction and access in the medical domain. For such applications, multilingual terminology plays a crucial role when working on specialized languages and specific domains. MATERIAL AND METHODS We propose firstly a method for enriching multilingual thesauri which extracts new terms from parallel corpora, and seco...

متن کامل

A Rule Based Pronunciation Generator and Regional Accent Databank for Portuguese

One of the major obstacles in deploying spoken language technologies (SLTs) in the developing world is a lack of key linguistic resources – e.g. electronic dictionaries, phonetically aligned corpora, pronunciation lexicons, etc. – that describe the non-dominant varieties spoken in such countries and regions. In this paper, we describe the work of the LUPo (Portuguese Unisyn Lexicon) project to ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004